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ABSTRACT 

Inference of new biological knowledge, e.g., prediction of protein 
function, from protein-protein interaction (PPI) networks has received 
attention in the post-genomic era. A popular strategy has been 
to cluster the network into functionally coherent groups of proteins 
and predict protein function from the clusters. Traditionally, network 
research has focused on clustering of nodes. However, why favor 
nodes over edges, when clustering of edges may be preferred? 
For example, nodes belong to multiple functional groups, but 
clustering of nodes typically cannot capture the group overlap, while 
clustering of edges can. Clustering of adjacent edges that share 
many neighbors was proposed recently, outperforming different node 
clustering methods. However, since some biological processes can 
have characteristic "signatures" throughout the network, not just 
locally, it may be of interest to consider edges that are not necessarily 
adjacent. Hence, we design a sensitive measure of the "topological 
similarity" of edges that can deal with edges that are not necessarily 
adjacent. We cluster edges that are similar according to our measure 
in different baker's yeast PPI networks, outperforming existing node 
and edge clustering approaches. 



1 INTRODUCTION 

A network {graph) consists of nodes and edges. Network research 
spans many domains. Biomedical domain is no exception. We 
focus on protein-protein interaction (PPI) networks, in which nodes 
are proteins and undirected edges correspond to physical binding 
between the proteins. Of all biological networks, we focus on PPI 
networks since it is the proteins, gene products, that carry out 
most biological processes, and they do so by interacting with other 
proteins. High- throughput screens for interaction detection, such as 
yeast two-hybrid (Y2H) assays or affinity purification coupled to 
mass spectrometry (AP/MS), have y ielded partial PPI networks for 
many model or ganisms and human (iGiot et al l l2003l : IStelzl et ail 



many model or ganisms and human 
l2005l : lYugf oZl EboS: Sim onis et al 



and viral pathogens dParrish et al. 



2009 ) , as well as for bacterial 



20071 : iLaCount et all l2005h . 



Many bio l ogical network datasets are now publicly available 
dPeri a/.Ll2004l : lBreitkreutz et a/.Ll2008h . 

Analogous to genomic sequence research, biological network 
research is expected to have invaluable impacts on our biological 
understanding. However, unlike genomic sequence research, 
biological network research is in its infancy, o wing to comp utational 
hardness of many graph theoretic problems (ICooki Il97lh , as well 
as to incompleteness of the available network data. Importantly, 
the number of functionally uncharacterized prot eins is large 
even for simple and well studied model organisms (ISharan et al.[ 
l2007h . Functional characterization of proteins via computational 
analysis could save resources needed for biological experiments. 
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In particular, PPI network analysis could help in suggesting 
top candidates for future experimental validation, since proteins 
aggregate to perform a function, and since PPI networks model these 
aggregations. 

Thus, it i s no s urprise that prediction of protein function 
dSharang? al.[ |2007|: iMilenkovic & Przull l2008h and the role of 
proteins in disease ("Sharan & Ideker", "2008'; "Radivojac et al.\ "2008'; 
iGoh et a/.L r2007: Milenkovic et al, 2010; Vanunu et al, 2010) from 
PPI networks have received attention in the post-genomic era. 
For example, it has been argued that proteins which are close 
in the ne twork are likely to be involved in similar biological 
processes jSharan et al that "topologically central" proteins 

correspond to "biologi cally central" (e.g., lethal, aging -, or 



cancer-related) proteins ( Jeong et a/.Ll200 
IJonsson & BatesL l2006l : 



Milenkovic et al. 



I jsharan & Idekeii 120081: 
201 iL or that proteins 
with similar t opological neighbourhoods have similar bi ological 
characteristics dMilenkovic & PrzulilEo08l : lHo et a/.Ll2010h . 

A particularly popular strategy for functional characterization 
of proteins has been to cluster the network into functionally 
"coherent" groups of nodes and assign the entire cluster with a 
funct i on based on funct i ons of its annotated members dSharan et all 
I2OO7I : ISharan & Idekeii l2008h . A variety of clustering approaches 
exist, each with its_own (dis)advantages CBrohee & van Heldenl 
I2OO6I : iFortunatoL l2010h . Typically, they aim to gr oup nodes 
that are in a dense connected network region (iFortunatoi 
l2010h . Also, approaches exist that cluster "topologically similar" 
nodes without the nodes necessarily being connected in the 
network. This is important, since a biological process can have 
characteristic topological "signatures" throughout the network, not 
just localy in close network proximitv (iMilenkovic & Przulil I2OO8I : 



Milenkovic et al. l|20i3;|HK et al For example, we designed 

a measure that computes the topological similarity of the extended 
network neighborhoods of t wo nodes, without the nodes necessarily 
being close in the network dMilenkovic & Przulil l2008h . We found 
that 96% of known cancer gene pairs that are topologically similar 
according to our measure are actually not neighbours in the PPI 
networ k; instead, they a re at the shortest path distance of up 
to SIX dMilenkovic et al.[ l2010h . As such, they may be missed 
by approaches that focus on connected nodes only. We clustered 
proteins in the human PPI network that are topologically similar 
and showed tha t function of a protein and i ts network position are 
closely related dMilenkovic & Przulil l2008h and that the topology 
aroun d cancer and non-cancer genes is different (iMilenkovic et al.[ 
i2010i) . We used these observations to predict new cancer genes in 
melanogenesis-related pathways and our predictions were validated 
phenotypically ( ^Ho et al. , 2010). 

Trad itionally, networ k research has focused on clustering of 
nodes dFortunatol i201Ql) . However, a network consists of nodes 
and edges. Hence, why favor nodes over edges, especially when 
clustering of edges may be preferred? For example, since nodes 
typically belong to multiple functional groups, and since clusters are 



expected to correspond to the functional groups, it may be desirable 
to allow for a node to belong to multiple clusters. Clustering of 
nodes typically cannot capture the group overlap, especially if the 
network is partitioned into disjoint node sets, as is the case with 
many (although not a ll) node clustering approaches (iFortunatol 
l201Ql : lAhn etal[\20l(t) . However, clustering of edges can trivially 
capture the group overlap (Fig. [T). Edge clustering methods 
were proposed only recently dAhn et a/.ll201(]| : lEvans & Lambiottel 
l2009h . Adjacent (connected) edges that share many neighbors 
were defined as similar and were thus clustered together (see 
below), outperforming different node clustering methods, includin g 
a method which allows for the group overlap dAhn et ali |201Q|) . 
However, it may be of interest to consider edges that are not 
necessarily adjacent (see above). 
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Fig. 1. Node clustering (left) versus edge clustering (right). 



Hence, we introduce a new measure of edge similarity that is 
not only capable of dealing with edges that are not necessarily 
adjacent, but is also a more sensitive measure of topolo gy than the 
above shared-neishborhood measure dAhn et a/.L [20101) . For a fair 
evaluation of our measure, when grouping edges that are similar 
according to our mea sure, we precisel y mimic the above edge 
clustering approach bv \Ahnet al\ tOld) . We show that using our 
measure results in clusters of comparable or better quality. 



2 APPROACH 

We recently designed a graphlet-based measure of the topological 
position of a node in the network; graph l ets ar e small induced 
subgraphs of a network (Fig. |2j dPrzulii l2007h . This measure 
generalizes the degree of a node that counts the number of edges 
that the node touches (where an edge is the only 2-node graphlet) 
into the node graphlet degree vector (node-GDV) that counts the 
number of different graphlets that the node touches, for all 2-5- 
node graphlets (Fig. |2j also, see Methods). Hence, node-GDV 
describes the top ology of the node's up to 4-deep neighborhood 
jMilenkovic & Pr zuli, 2008). This is effective: going to distance of 
four around a node captures a large portion of t he network due to 
the sm all-world nature of many real networks dWatts & Strogatzl 
Il998h . For this reason, and since the number of graphlets on n 
nodes increases exponentially with n, using larger graphlets could 
unnecessarily increase the computational complexity of the method. 
Also, we designed node-GDV-similarity measure to compare node- 
GDVs of two nodes and hence quantify the topological similarity 
of th eir extended network neighborhoods dMilenkovic & Przulil 
l2008h . 

Since a graphlet consists of nodes and edges, we now design 
edge-GDV to count the number of different 3-5-node graphlets that 
an edge touches (Fig. [J]). (We exclude the count for the only 2-node 



graphlet, an edge, as each edge touches exactly one 2-node graphlet, 
itself.) Also, we design edge-GDV-similarity to compare edge- 
GDVs of two edges and hence quantify the topological similarity 
of their extended net work neighborho ods. Unlike the shared - 
neighborhood measure dAhn etali^m^ , edge-GDV-similarity can 
measure similarity between edges independent on whether they are 
adjacent. Also, by counting the shared neighbors of end nodes of 
two (adjacent) edges, the shared-neighborhood measure actuall y 
counts the 3 -node paths that the end nodes share 
Since edge-GDV counts the different 3-5-node graphlets that an 
edge touches, including a 3-node path, edge-GDV is a more 
constraining measure of topology. See Methods for details. 

We evaluate our approach against existing clustering methods, 
as follows (also, see Methods). The existing edge clustering 
method mentioned above, henceforth denoted by edge - shared 
neighborhood (edge-SN), was already shown to be superior to 
different node clustering methods on four baker's yeast PPI 
networks dAhng^ a/.L[201Qh . For a fair evaluation, we mimic edse- 
SN exactly, except that we use edge-GDV-similarity instead of the 
shared-neighborhood measure as the distance metric for the same 
clustering method, namely hierarchical clustering. Just as edge-SN, 
we (initially) cluster only adjacent edges, and of all partitions, we 
choose the one with the maximum density (see Methods). Just as 
edge-SN, we evaluate such partition with respect to four measures: 
cluster coverage (the portion of the network "covered" by "non- 
trivial" clusters), overlap coverage (the amount of node overlap 
between clusters), clu ster quality (enrichmen t of clusters in Gene 
Ontology (GO) terms dAshbumer et alUlOOd) ), and overlap quality 
(the correlation between the number of clusters and the number of 
GO terms that nodes participate in). When applied to the same yeast 
networks, our approach in comparable or superior to edge-SN (and 
hence to the node clustering approaches that were outperformed by 
edge-SN on the same networks). Thus, we gain by using a more 
sensitive measure of topology compared to edge-SN. Furthermore, 
when we cluster both adjacent and non-adjacent edges, our method 
in general performs even better. Hence, we gain additionally by 
using a measure that can capture similarity of edges that are not 
necessarily adjacent. We note that we do not propose a new edge 
clustering method but a new edge similarity measure that can serve 
as a distance metric for existing clustering methods. 



3 METHODS 

3.1 Data sets 

We cluster t he same fo ur baker's yeast PPI n etworks that edge-SN was 
evaluated on lAhn et al , 2010; Yu ^/.Ll2008h : 1) Y2H network, obtained 
by Y2H, with 1,647 proteins and 2,518 PPIs; 2) AP/MS network, obtained 
by AP/MS, with 1,004 proteins and 8,319 PPIs; 3) LC network, obtained by 
literature curation, with 1,213 proteins and 2,556 PPIs; and 4) ALL network, 
representing the union of Y2H, AP/MS, and LC, with 2,729 proteins and 
12,174 PPIs. Using these different networks ensures that our method is 
robust to different types of experiments for PPI detection. 

3.2 Related work 

We compare our method to th ree popular node clustering metho ds: clique 
perco lation (Palla et al, 20 0j), greedy modularity optimization (iNewmanl 
l2004h , and Infomap iRosvall & Bergstroml l2008h . Al so, we compare i t 
to the existing edge clustering algorithm, edge-SN dAhn et ail |2010|) . 
Briefly, clique percolation is the most prominent overlapping node clustering 
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algorithm, greedy modularity optimization is the most popular modularity- 
based technique, and Infomap is often considered the most accurate method 
available {Ahn etal, 2010). Edge-SN hierarchically groups adjacent edges 
whose non-common end-nodes share many neighbors (see below). We did 
not run these algorith ms on the yeast n etworks ourselves. Instead, we use 
the results reported bv lAhn et al\i20l(h who ran the algorithms o n the same 
netwo rks. For details on how the methods were implemented, see l Ahn et al\ 
llOld) . We do explain how edge-SN was implemented, as we implement our 
method in the same way (except that we use a different distance metric). 

Edge-SN algorithm works as follows. If the set of node i and its neighbors 
is denoted as n(i), the similarity between adjacent edges e^^ and ejk with 
common node k is S(eik,ejk) = \n(i) D n(j)\/\n(i) U n(j)\. This 
shared-neighborhood measure is used as a distance metric for single-linkage 
hierarchical clustering. With this method, a tree, or dendrogram, is created. 
Leaves of the tree are edges of the network and an interior node in the 
tree represents a cluster made up of all children of the node. The tree is 
constructed by assigning each edge to its own cluster and iteratively merging 
the most similar pair of clusters. The tree has to be cut in order to create a 
partition of K clusters. To determine where to cut the tree, edge-SN uses 
an objective function called partition density, computed as follows. For a 
network with M edges, {Pi, • • • , Pk} is a partition of the edges into K 
clusters. Cluster C has mc = \C\ edges and nc = \ Ue-^ ec hj\ nodes. 
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C's density is Dc 



and the partition density 



Edge-SN cuts the tree at different levels and chooses a partition with the 
maximum value of D. However, meaningful struct ure may also exis t above 
and below the level corresponding to maximum D jAhn etall\20l(i) . 

3.3 New measures of network topology: edge graphlet 
degree vector (edge-GDV) and edge-GDV-similarity 

A graphlet is an induced subgraph of graph X that contains all edges of 
X connecting its nodes (Fig. |2). We generalized the degree of node v that 
counts the number of edges that v touches (where an edge is the only 2-node 
graphlet. Go in Fig. [3 into node graphlet degree vector (node-GDV) of v 
that counts the number of 2-5-node graphl ets (Go, Gi, . . ., G29 in Fig. [3 
that V touches iMilenkovic & Przulll2008h . We need to distinguish between 
V touching, e.g., a Gi at an end node or at the middle node, since Gi admits 
an automorphism that maps its end nodes to each other and the middle node 
to itself. To understand this, recall the following. An isomorphism / from 
graph X to graph y is a bijection of nodes of X to nodes of Y such that xy 
is an edge of X if and only if f(x)f(y) is an edge of y. An automorphism 
is an isomorphism from X to itself. The automorphisms of X form the 
automorphism group, Aut(X). If x is a node of X, then the automorphism 
node orbit of x is Orb(a::) = {y e V(X)\y = /(x) for some / G 
Aut(X)}, where V{X) is the set of nodes of X. Thus, end nodes of a 
Gi belong to one node orbit, while its middle node belongs to another one. 
There are 73 node orbits for 2-5-node graphlets. Hence, node-GDV of v has 
73 elements counting how many node orbits of each type touch v (v's degree 
is the first element). It captures v's up to 4-deep neighborhood and thus a 
large portion of real networks, as they are small- world (Watts & Strogatd , 

Since a graphlet contains nodes and edges, we propose a new graphlet- 
based measure of the topological position of an edge in the network. We 
define edge-GDV to count the number of graphlets that an edge touches at 
a given "edge orbit" (Fig. |2)- We define edge orbits are follows. Given the 
automorphism group of graph X, Aut(X), if xy is an edge of X, then 
the edge orbit of xy is Orbe(xy) = {zw G E(X)\z = f(x) and w = 
f(y) for some / G Aut(X)}, where E(X) is the set of edges of X. For 
example, in Fig. [21 in a Gi, both edges are in edge orbit 1. In a G2, all three 
edges are in edge orbit 2. In a G3, the two "outer" edges are in edge orbit 
3, while the "middle" edge is in edge orbit 4. And so on. There are 68 edge 
orbits for 3 -5 -node graphlets. (We intentionally exclude orbit in the only 
2-node graphlet. Go, as each edge touches exactly one Go, namely itself.) 
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Fig. 2. All the connected graphs on 2 to 5 nodes. When appearing as 
induced subgraphs of a network, they are called graphlets. They contain 
73 topologically unique node types, or "no de orbits." In a graphlet, nodes in 
the same node orbit are of the same shade (|PrzuliLl2007l) . They also contain 
69 topologically unique edge types, or "edge orbits." (3-5-node graphlets 
contain 68 edge orbits.) Edge orbit of an edge is defined by node orbits of its 
end nodes. In a graphlet, different edge orbits are numbered differently. 



Comparing edge-GDVs of two edges gives a sensitive measure of their 
topological similarity, since their extended network neighborhoods are 
compared. Using some existing measure, e.g., Euclidean distance, to 
compare edge-GDVs might be inappropriate, as some edge orbits are not 
independent. Instead, we design edge-GDV-similarity measure as follows. 



For an edge e, Ci is the i element of its edge-GDV. The distance 
between the i*^ edge orbits of edges e and / is D^(e,/) = Wi x 
\iog(ei-\-i)-iog(fi-}-i)\ ^Yiqtq Wi is the Weight of edge orbit i that accounts 

log{max{ei,fi}-\-2) ' t> t> 

for edge orbit dependencies. For example, the differences in counts of orbit 
2 of two edges will imply the differences in counts of all other orbits 
that contain orbit 2, such as orbits 8-12 (Fig. |2). This is applied to all 
edge orbits: the smaller the number of orbits that affect orbit i (including 



itself), Gi, the higher its weight Wi, where Wi 



log(68)' 



Cleariy, 



Wi is in (0,1] and the highest weight of 1 is assigned to orbit i with 
Oi = 1. The log is used in the formula for Di because the i*^ elements 
of two edge-GDVs can differ by several orders of magnitude and we do 
not want the distance between edge-GDVs to be dominated by large values; 
also, we want to account for the relative difference between Ci and fi 
and that is why we divide by the value of the denominator, which also 
scales Di to [0, 1). The constants are added to prevent Di to be infinite. 

The total distance is D(e, f) = ^68° • Finally, edge-GDV-similarity 

is S{eJ) = 1 - D{eJ). It is in"(0, 1]. The higher the edge-GDV- 
similarity, the higher the topological similarity of edges' extended network 
neighborhoods. We design edge-GDV-similarity as described because we 
already designe d node-GDV-similari t y, wh ich compares node-GDVs, in 
a similar way ( IMilenkovic & Przulil , l2008h , and because we showed in 
different contexts that n ode-GDV-similarity successfully extracts function 
from network t opologv (IM ilenkovic et all I2OIOI: iMemisevic et al. 1 12OIOI: 



lKuchaievgf'flni2010l : IMilenkovic gf a/.L 1201 0IJ2O I lk So, we expect edge- 
GDV-similarity to successfully extract function from topology as well. 

3.4 Our clustering strategies 

We cluster the yeast PPI networks in the same manner as edge-SN, except 
that we use edge-GDV-similarity as the distance metrics instead of using 
the shared-neighborhood measure. Initially, for a fair comparison with edge- 
SN, we cluster adjacent edges only, to test if and how much we gain by 
using our more sensitive measure of edge similarity. Later on, we cluster all 
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edges, to test if and how much we gain by taking into account edges that are 
not necessarily adjacent. Some further information is provided below, after 
defining measures of partition quality. 

3.5 Quality of partitions 

We evaluate a partition with respect to the same measures that were used 
by edge-SN: cluster coverage (CC), overlap coverage (OC), cluster quality 
(CQ), and overlap quality (OQ). CC is the fraction of nodes that belong to 
at least one "non-trivial" cluster of three or more nodes. OC is the average 
number of non-trivial clusters that nodes belong to. CQ is the ratio of the 
average Gene Ontology (GO) term ( As hbumer et aA [2OO0I) similarity over 
all node pairs that are in at least one same cluster and the average GO terms 
similarity over all node pairs in the network. OQ is the mutual information 
between the number of GO terms and the number of non-trivial clusters that 
proteins are involved in. Ra w values for the fo ur measures do not necessarily 
fall in [0, 1]. Hence, just as lAhn et al\ ( I2OIOI) . we normalize each measure 
such that the best method has a value of one. Then, the overall partition 
quality is the sum of these four normalized measures, such that the maximum 
achievable score is four. 

We can now note the following. To mimic lAhn et al\ dioiol) . we would 
report for a given network the partition with maximum partition density 
D. However, we find that CC is strongly negatively correlated with CQ 
and OQ, and sometimes with OC, over all of our partitions (Fig. [3j. Thus, 
choosing the partition with low CC would result in high CQ and OQ (and 
sometimes OC), hence artificially increasing the overall partition quality. 
Since in three out of four yeast networks CC is lower for edge-SN than for the 
node clustering methods, it might not be surprising that edge-SN's overall 
partition quality is the highest. Analogously, since edge-SN's partitions 
with maximum D have lower CC than our partitions with maximum D, 
our partitions may have lower overall partition quality simply because of 
the strong negative correlation between CC and other measures. Hence, 
we find the partition with maximum D among all partitions that have CC 
less than or equal to CC of edge-SN's partition with maximum D. Then, 
we report either the partition obtained in this way or the partition with 
maximum D (independent of its CC), whichever has better overall partition 
quality. Furthermore, when we cluster both adjacent and non -adjacent edges, 
selecting the partition based on its density, as just described, might be 
inappropriate (see above). Thus, when we cluster both adjacent and non- 
adjacent edges, we also report the partition with the best overall partition 
quality. 

4 RESULTS 

We evaluate our method against three node clustering methods 
(clique percolation - CliqPerc, greedy modularity optimization - 
GreedMod, and Infomap) and one edge clustering method (edge- 
SN) on four yeast PPI networks (Y2H, AP/MS, LC, and ALL), with 
respect to four measures of partition quality (cluster coverage - CC, 
overlap coverage - OC, cluster quality - CQ, and overlap quality 
- OQ) that are combined into the normalized overall partition 
quality; see Methods. We denote our method when clustering 
adjacent edges only and reporting the partition with the maximum 
density as eGDV-A-D. We denote our method when clustering both 
adjacent and non-adjacent edges and reporting the partition with the 
maximum density as eODV-NA-D. We denote our method when 
clustering both adjacent and non-adjacent edges and reporting the 
partition with the best overall partition quality as eGDV-NA-B. See 
Methods for details. Results are shown in Fig.|4] 

We gain by using edge-GDV- similarity for clustering: eGDV- 
A-D outperforms all node clustering approaches on all networks. 
(This includes node clustering by using node-GDV- similarity, as 
shown in Fig. [S]). Also, it outperforms edge-SN on Y2H and 
AP/MS. Although edge-SN is slightly better than and comparable 



to eGDV-A-D on LC and ALL networks, respectively, eGDV-NA-D 
outperforms edge-SN on these two networks, as well as on AP/MS. 
Hence, we gain further by clustering non-adjacent edges in addition 
to adjacent ones. The only exception is Y2H, for which edge-SN is 
slightly better than eGDV-NA-D. However, as already noted, eGDV- 
A-D outperforms edge-SN on Y2H network. Hence, we are always 
superior, with either eGDV-A-D or eGDV-NA-D or both eGDV-A- 
D and eGDV-NA-D. With eGDV-NA-B, we further demonstrate our 
superiority over all other methods on all networks. 
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Fig. 4. Method comparison for (A) Y2H, (B) AP/MS, (C) LC, and (D) ALL yeast PPI networks. The following methods are compared: clique percolation 
(CliqPerc), greedy modularity optimization (GreedMod), Infomap, edge-SN, our method when clustering adjacent edges only and choosing the partition 
with the maximum density (eGDV-A-D), our method when hierarchically clustering both adjacent and non-adjacent edges and choosing the partition with 
the maximum density (eODV-NA-PD), and our method when hierarchically clustering both adjacent and non-adjacent edges and choosing the partition with 
the best overall partition quality (eGDV-NA-B). Clustering methods are compared with respect to the following measures: cluster coverage (CC), overlap 
coverage(OC), cluster quality (CQ), and overlap quality (OQ). The overall partition quality score (y-axis) is the sum of these four measures after each is 
normalized to [0,1], such that the maximum achievable score is four. 
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Fig. 5. Comparison of our node and edge clustering methods on the four yeast PPI networks (Y2H, LC, AP/MS, and ALL). nGDV-A denotes the node 
clustering method when node-GDV-similarity is used as the distance metric for hierarchical clustering of adjacent nodes only and the partition with the 
maximum partition density is selected. nGDV-NA denotes the node clustering method when node-GDV-similarity is used as the distance metric for hierarchical 
clustering of both adjacent and non-adjacent nodes and the partition with the best overall partition quality is selected. eGDV-A denotes the edge clustering 
method when edge-GDV-similarity is used as the distance metric for hierarchical clustering of adjacent edges only and the partition with the maximum partition 
density is selected. eGDV-NA denotes the edge clustering method when edge-GDV-similarity is used as the distance metric for hierarchical clustering of both 
adjacent and non-adjacent edges and the partition with the best overall partition quality is selected. The clustering methods are compared with respect to the 
following measures: cluster coverage (CC), overlap coverage (OC), cluster quality (CQ), and overlap quality (OQ). The overall partition quality score (y-axis) 
is the sum of these four measures after each is normalized to [0,1], such that the maximum achievable score is four. We compare our approach when using 
edge-GDV-similarity as the distance metric for edge clustering against our approach when using node-GDV-similarity as the distance metric for node clustering 
since we want to answer if and how much we gain by clustering of edges compared to clustering of nodes. And to answer this, one should use conceptually 
similar edge and node clustering methods, such as these. The figure shows that in each network: 1) we gain by clustering both adjacent and non-adjacent nodes 
compared to clustering only adjacent nodes; 2) we gain further by clustering adjacent edges instead of clustering nodes; and 3) we gain the most by clustering 
both adjacent and non-adjacent edges. 



7 



